Dimensionality Reduction Techniques for Document Clustering- A Survey
نویسندگان
چکیده
Dimensionality reduction technique is applied to get rid of the inessential terms like redundant and noisy terms in documents. In this paper a systematic study is conducted for seven dimensionality reduction methods such as Latent Semantic Indexing (LSI), Random Projection (RP), Principle Component Analysis (PCA) and CUR decomposition, Latent Dirichlet Allocation(LDA), Singular value decomposition (SVD). Linear Discriminant Analysis(LDA)
منابع مشابه
Jonathan L. Elsas. An Evaluation of Projection Techniques for Document Clustering: Latent Semantic Analysis and Independent Component
Dimensionality reduction in the bag-of-words vector space document representation model has been widely studied for the purposes of improving accuracy and reducing computational load of document retrieval tasks. These techniques, however, have not been studied to the same degree with regard to document clustering tasks. This study evaluates the effectiveness of two popular dimensionality reduct...
متن کاملA Systematic Study on Document Representation and Dimensionality Reduction for Text Clustering A Systematic Study on Document Representation and Dimensionality Reduction for Text Clustering
Increasingly large text datasets and the high dimensionality associated with natural language is a great challenge of text mining. In this research, a systematic study is conducted of application of three Dimension Reduction Techniques (DRT) on three different document representation methods in the context of the text clustering problem using several standard benchmark datasets. The dimensional...
متن کاملThe Effect of Word Sampling on Document Clustering
Many techniques have been used for document clustering that depended on the number of word occurrences in documents. In these techniques, words are considered as dimensions of the clustering space. Since a huge number of words is found in each document, studies were held to reduce this high dimensionality for better performance i.e., words pruning. Sampling was used to choose random documents r...
متن کاملEffective Dimension Reduction Techniques for Text Documents
Frequent term based text clustering is a text clustering technique, which uses frequent term set and dramatically decreases the dimensionality of the document vector space, thus especially addressing: very high dimensionality of the data and very large size of the databases. Frequent Term based Clustering algorithm (FTC) has shown significant efficiency comparing to some well known text cluster...
متن کاملTechniques for Spectral Clustering
Spectral techniques, off late, have been in limelight in the machine learning community and has drawn attention of many serious machine learners. They are being used in a variety of applications like gene clustering, document analysis, image segmentation, dimensionality reduction etc. They are very simple to understand and provide highly accurate results even for difficult clustering problems. ...
متن کامل